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ON FUNCTIONAL RELATIONS FOR WHICH THE 

COEFFICIENT OF CORRELATION IS ZERO. 

By H. L. Rietz, University of Iowa. 



The recent papers of Reed* and Harris in these Publications 
have brought to my mind some simple cases in which a corre- 
lation coefficient is zero although the two variables are mathe- 
matical functions of each other represented by certain simple 
types of continuous curves. 

Those who have studied critically the theory of correlation 
are perhaps aware of the limitations of this valuable theory 
as well as they are aware of its useful applications. But those 
who are making applications without fundamental knowledge 
of the method are apt to overlook the limitations in applying 
a summary method of quantitative description such as is 
provided by the correlation coefficient. 

The following rather striking illustration should serve to 
make clear a limitation on the generality of the correlation 
coefficient as a measure of correlation. Suppose we had 
observed as corresponding variates the coordinates x and y in 
a simple harmonic motion given by 

y = cos\x (1) 

in the interval from x = to x = — . 

X 

That is to say, let us consider the case in which we have 

given the corresponding values of x and y in (1) for a set of 

n x's in the interval from to — . Let x t and y, (t = l, 2, 3, 

\ 

. . . , n) be corresponding variates; x and y the mean values 
of x's and y's respectively; and <r x and <r y the standard devia- 
tions. Assume further that the points (x t , y t ) are symmetrical 

about the middle point of the interval, so that x = - . Then 

y = 0. Furthermore, it is clear from symmetry of points 
about (x, y) that the correlation coefficient 

* Reed, Vol. 15, pp. 670-684, 1917; Harris, J. Arthur, loo. cit., pp. 803-805. 
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l ^{x t -x){y t -y) V(x,-fW-0) 

r-!=i = Ml X/ =0 (2) 

na x a y n<r x <r v 

Similarly, if we deal with a continuous distribution of points 
along the curve y = cos Xa; where kdx (k constant) is the fre- 
quency of values in the interval dx, we have 

k f x 
r = 



"?J y ( x -ih 



where s 2 is the geometrical mean of the second moments of the 
distribution about the two lines x = -, and y = 0. That is to 

A 

say, we obtain a correlation coefficient r=0 for a case where 
the relationship is given by the function y = cos Xa;. 

If the points were distributed uniformly with respect to 
intervals Aa; in a thin band along the curve y = cos Xa; instead 
of being exactly on the curve, r would be nearly zero. 

These results do not at all constitute a reflection on the 
use of the correlation coefficient for many purposes, but they 
do show that it is improper to infer that no correlation* exists 
because r is equal to or nearly zero. 

Westf states that "if the regression curve is of a certain 
shape the value of r will be very small even though practically 
perfect correlation exists." It is clear from the above illus- 
tration that it would be justifiable to strengthen this state- 
ment by saying that if the regression curve is of a certain 
shape the value of r would be zero in certain cases even when 
the one variable is a certain trigonometric function of the 
other. 

Instead of the distribution given by y = cos Xa;, we may take 
the more general case y=f(x) in the interval — a to a where 
f(x) is a single valued function symmetrical about the line 
y = 0. When the distribution of the points in equal intervals 
Aa; is uniform in the sense prescribed above for y = cos Xa;, we 
obtain for y=f(x) also the result r=0. 

* Cf. Yule, Proc. of Royal Society, Vol. 60, p. 477, 1897. 
t Introduction to Mathematical Statistics, pp. 84-85. 
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For values of x in the interval — a to a, a class of functions 
for which r =0 is given hyy=f(x) subject to the condition 



/ 



xydx=0. 



Hence, it is simply the first moment about the y-axis of areas 
bounded by y=f(x), the x-axis, and ordinates x= —a and x=a 
that must vanish. This condition is satisfied by a variety of 
simple functions. 

Harris holds that Reed put too much emphasis on the 
importance of testing the linearity of regression in the early 
stages of a correlation study. With this position taken by 
Harris, the writer is in agreement if the question is one merely 
of showing the existence of correlation rather than one of 
showing the degree of correlation; for, if regression is not 
linear, the value of r turns out to be smaller than the correla- 
tion ratio — a function appropriate to describe correlation 
without a limitation of linear regression. That is to say, the 
use of r does not lead us to infer a greater degree of correlation 
than, exists, but in cases of non-linear regression it may lead us 
to infer a smaller degree of correlation than exists. 

The interpretation of the significance of differences between 
two correlation coefficients cannot go far until careful inquiry 
is made into the form of the regression curve. The prediction 
of the mean value of y that corresponds to an assigned value 
of x is likely to be valuable only when the form of the regression 
curve is known. 

Let us consider next the application of the correlation ratio 
instead of the correlation coefficient to a distribution of points 
given by a single valued function such &y = cos \x. By defini- 
tion, the correlation ratio of y on x is 

% = —' 
a v 

where <tz is the standard deviation of the means of arrays of 

"a 

y's that correspond to equal assigned intervals Aa; of the 
variable x, when the square of the deviation y x — y of the mean 
y x of any array from the mean of the total population y is 
weighted with n X) the number in the array. 
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Consider the case of a set of n points on the curve y = cosXa; 
distributed so that the same number of points are found in 
equal intervals Ax. When each Ax is decreased so as to con- 
tain only one point, it is clear that 

Similarly, when we assume a uniform continuous distribu- 
tion of points such that kdx (k constant) gives the frequency 
for the interval dx, we have, when dx is decreased indefinitely, 

Thus, i\ y is an appropriate measure of the dependence of 
y on x. , 

Let us find next the correlation ratio of x on y for the rather 
artificial and freakish case from a statistical standpoint of 
looking upon a; as a function of y for the distribution of sets of 
points described above. Thus, 

1 

x—- arc cos y 

A 

is a two valued function of y in the interval y = — 1 to 1, when 

x is restricted to the interval to — . If we construct arrays 

X 

of x's that fall into intervals dy, it is obvious from symmetry 
that x — - is the line of regression — all the means falling 

A 

exactly on this line. But for the entire distribution, x=~. 

X 

Hence, it is clear that the correlation ratio of x on y gives 

i».=0. 

Thus, r) x =Q although a; is a certain well known double 
valued mathematical function of y. Such an extreme case 
seems to be of value only in showing the logical character of 
the correlation ratio. 

But ik=0 does even in this case give the correct summary 
result in answer to the question as to the extent to which mean 
values of x tend to arrange themselves along a single curve, 
known as the curve of regression of x's on y. But this ques- 
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tion and answer are of little or no value when instead of an 
arrangement along a single curve, we have arrangements of 
points along two curves as cited in the illustration. The main 
point to be kept in mind in this connection is that the correla- 
tion ratio gives a summary description of the deviations of the 
points from a single valued function. 



